The simplest flow control is conditional execution if. It takes a vector of length 1 and executes the statement if the conditional TRUE.
## [1] "This is tautological!"
## [1] "Same as above, but implicitly"
## [1] "Does it even worth mention it?"
## [1] "5 is less than 10"
Tip: For the sake of keeping good programming practices, it is recommended to employ curly brackets.
Conditional operations have little sense if there are no actions when the initial statement is not met, in this case we are going to use else.
x <- 8
if(x<=7){
print("x is less equal than 7")
} else { # else MUST appear in the same line were the curly bracket closes
print("x is more than 7")
}## [1] "x is more than 7"
Certainly, more complex conditional operations can be created adding if after the else. For instance:
if(x<=7){
print("x is less equal than 7")
} else if(x<10) {
print("x is more than 7 but less than 10")
} else {
print("x is more than 10")
}## [1] "x is more than 7 but less than 10"
The aforementioned statements only accept vectors of length zero, nonetheless, there are built-in functions that perform the same conditionals that can also be expanded to vectors. Try it if you want
The vectorized conditional function is ifelse, it is going to be truly useful is the future.
## [1] "less than 5" "less than 5" "less than 5"
## [4] "less than 5" "more equal than 5" "more equal than 5"
## [7] "more equal than 5" "more equal than 5" "more equal than 5"
## [10] "more equal than 5"
Loops are used in programming to perform a specific task recursively. In this section, we will learn how to create loops in R.
repeat, while and for.for loop## [1] FALSE
## [1] FALSE
## [1] FALSE
## [1] FALSE
## [1] FALSE
## [1] FALSE
## [1] TRUE
## [1] FALSE
## [1] FALSE
## [1] FALSE
while loopwhile loops check first if a condition is met if it does, executes, otherwise, it does nothing.temperature <- 10
while(temperature < 18){
print(paste0("If the temperature is "
, temperature, " C°: Do not swim"))
temperature=temperature+1
}## [1] "If the temperature is 10 C°: Do not swim"
## [1] "If the temperature is 11 C°: Do not swim"
## [1] "If the temperature is 12 C°: Do not swim"
## [1] "If the temperature is 13 C°: Do not swim"
## [1] "If the temperature is 14 C°: Do not swim"
## [1] "If the temperature is 15 C°: Do not swim"
## [1] "If the temperature is 16 C°: Do not swim"
## [1] "If the temperature is 17 C°: Do not swim"
temperature < 18 evaluates according to the current vector’s value.while loops have to include an incremental statement, falling to do so will create a…Infinite loop!
repeat looprepeat is often used jointly with stop or break## [1] 1
## [1] 2
## [1] 3
## [1] 4
repeat for or, while.rep and replicate represent the basic idea of functions applied to vectors.## [1] -0.5604756 -0.5604756 -0.5604756 -0.5604756 -0.5604756 -0.5604756
## [7] -0.5604756
## [1] -0.23017749 1.55870831 0.07050839 0.12928774 1.71506499 0.46091621
## [7] -1.26506123
apply familyapplyX, MARGIN and FUN
## [,1] [,2] [,3] [,4] [,5]
## [1,] 1 6 11 16 21
## [2,] 2 7 12 17 22
## [3,] 3 8 13 18 23
## [4,] 4 9 14 19 24
## [5,] 5 10 15 20 25
## [1] 55 60 65 70 75
## [1] 15 40 65 90 115
lapplyset.seed(123)
list <- list(e1=rnorm(100, 5, 1)
, e2=rnorm(100, 10, 1)
, e3=rnorm(100, 15, 1)
, e4=list(rnorm(100, 5, 1)*100))
lapply(X = list, FUN = mean)## Warning in mean.default(X[[i]], ...): argument is not numeric or logical:
## returning NA
## $e1
## [1] 5.090406
##
## $e2
## [1] 9.892453
##
## $e3
## [1] 15.12047
##
## $e4
## [1] NA
## $e1
## [1] 4.439524
##
## $e2
## [1] 9.289593
##
## $e3
## [1] 17.19881
##
## $e4
## [1] 496.3777
What’s happening above?
sapplylapplylapply if the argument simplify is set at TRUE (the default value)## e1 e2 e3 e4
## 4.439524 9.289593 17.198810 496.377709
## $e1
## [1] 4.439524
##
## $e2
## [1] 9.289593
##
## $e3
## [1] 17.19881
##
## $e4
## [1] 496.3777
mapply## [[1]]
## [1] 1.926444
##
## [[2]]
## [1] 0.8313486 1.3652517
##
## [[3]]
## [1] 1.9711584 2.6706960 0.3494535
##
## [[4]]
## [1] 1.650246 2.756406 1.461191 2.227292
##
## [[5]]
## [1] 2.492229 2.267835 2.653258 1.877291 1.586323
A preliminary for data analysis is: having data. Some datasets are “pretty”, that is, they come in tabular format, a little cleaning and we are done. On the other side, there are unstructured data, typically text-heavy files that demand a huge amount of time in order to be used as an input.
Have you ever heard of the quote “big rocks first”, well, we will do the opposite here? Let’s start by showing how to import, create and format tabular datasets.
The easiest form of data to import in R are spreadsheet-like text files.
## [1] "read.dcf" "readBin" "readChar" "readline"
## [5] "readLines" "readRDS" "readRenviron" "Sys.readlink"
## [1] "read.csv" "read.csv2" "read.delim"
## [4] "read.delim2" "read.DIF" "read.fortran"
## [7] "read.fwf" "read.socket" "read.table"
## [10] "readCitationFile"
.txt filesOpen a .txt files could easily become a Pandora’s Box, you just never know if you are about to spread misery in your work for days!
Problems:
| Locale | Format |
|---|---|
| Canadian (English and French) | 4 294 967 295,000 |
| German | 4 294 967.295,000 |
| Italian | 4.294.967.295,000 |
| US-English | 4,294,967,295.00 |
What we see:
What R sees:
## [1] "date\tiso_a3\tcurrency_code\tname\tlocal_price"
## [2] "4/1/2000\tARG\tARS\tArgentina\t2.5"
## [3] "4/1/2000\tAUS\tAUD\tAustralia\t2.59"
## [4] "4/1/2000\tBRA\tBRL\tBrazil\t2.95"
## [5] "4/1/2000\tCAN\tCAD\tCanada\t2.85"
## [6] "4/1/2000\tCHE\tCHF\tSwitzerland\t5.9"
.csv filesread.csv formats character values as factors. This is inefficient since R has to map the values inside the vector a recognize how many different values exist within to form levels. Therefore, it is advisable to set stringsAsFactors=FALSEHow to import a .csv to our environment?
## 'data.frame': 1218 obs. of 19 variables:
## $ date : chr "2000-04-01" "2000-04-01" "2000-04-01" "2000-04-01" ...
## $ iso_a3 : chr "ARG" "AUS" "BRA" "CAN" ...
## $ currency_code: chr "ARS" "AUD" "BRL" "CAD" ...
## $ name : chr "Argentina" "Australia" "Brazil" "Canada" ...
## $ local_price : num 2.5 2.59 2.95 2.85 5.9 ...
## $ dollar_ex : num 1 1.68 1.79 1.47 1.7 ...
## $ dollar_price : num 2.5 1.54 1.65 1.94 3.47 ...
## $ USD_raw : num -0.004 -0.386 -0.343 -0.228 0.383 -0.023 -0.524 -0.446 0.226 -0.051 ...
## $ EUR_raw : num 0.05 -0.352 -0.308 -0.186 0.458 0.03 -0.498 -0.416 0.293 0 ...
## $ GBP_raw : num -0.167 -0.486 -0.451 -0.354 0.156 -0.183 -0.602 -0.537 0.025 -0.207 ...
## $ JPY_raw : num -0.099 -0.444 -0.406 -0.301 0.251 -0.116 -0.569 -0.499 0.11 -0.142 ...
## $ CNY_raw : num 1.091 0.289 0.378 0.622 1.903 ...
## $ GDP_dollar : num NA NA NA NA NA NA NA NA NA NA ...
## $ adj_price : num NA NA NA NA NA NA NA NA NA NA ...
## $ USD_adjusted : num NA NA NA NA NA NA NA NA NA NA ...
## $ EUR_adjusted : num NA NA NA NA NA NA NA NA NA NA ...
## $ GBP_adjusted : num NA NA NA NA NA NA NA NA NA NA ...
## $ JPY_adjusted : num NA NA NA NA NA NA NA NA NA NA ...
## $ CNY_adjusted : num NA NA NA NA NA NA NA NA NA NA ...
foreign: Reading and writing data stored by some versions of ‘Epi Info’, ‘Minitab’, ‘S’, ‘SAS’, ‘SPSS’, ‘Stata’, ‘Systat’, ‘Weka’, and for reading and writing some ‘dBase’ files.haven: Import and Export ‘SPSS’, ‘Stata’ and ‘SAS’ Files## re-encoding from CP1252
## [1] 439 134
attributes()## id sex age
## "" "sex" ""
## marital child educ
## "marital status" "child" "highest educ completed"
Haven is extremely useful since it follows the Tidy philosophy that is taking place in R. (we will cover this in depth the next session)Do you see any difference?
write*. For instance:write.csv(x = export_obj, file = "datasets/sample1.csv")
write.csv2(x = export_obj, file = "datasets/sample2.csv")Are sample1 and sample2 equal? Let’s see
## [1] "\"\",\"id\",\"sex\",\"age\"" "\"1\",415,\"FEMALES\",24"
## [1] "\"\";\"id\";\"sex\";\"age\"" "\"1\";415;\"FEMALES\";24"
Write:
Read:
Write:
Read:
Data analysis workflow (source: Wickham & Garret, 2017)
Related packages (not covered):
data.table: Fast aggregation of large data, fast ordered joins, fast add/modify/delete of columns by a group using no copies at all, list columns, friendly and fast character-separated-value read/write. Offers a natural and flexible syntax, for faster development.
There are many ways to structure the same underlying data
Structure #2
## rebecca thomas janna
## treatment_a 1 3 4
## treatment_b 2 6 8
Tidy structure
First things first…
x <- survey$age # New intermediary variable
age_2 <- x^2 # Apply the function
(mean_2 <- mean(age_2)) # Calculate the mean of squared age## [1] 1575.465
Evidently, one could also use the following process
## [1] 1575.465
Cleaner, isn’t it? But, can you believe that there is a way to this more consistent and readable?
So, if we want to get the mean value of the squared age:
## [1] 1575.465
matrittr allows us to create a more readable code
Basic pipes
- x %>% f is equivalent to f(x)
- x %>% f(y) is equivalent to f(x, y)
- x %>% f %>% g %>% h is equivalent to h(g(f(x)))Placeholder
(survey_2 <- as_tibble(survey[1:100
, c("sex", "age", "educ", "mast1")]
)
) # Let's create a sample of survey from the SPSS filemutate() adds new variables that are functions of existing variablesselect() picks variables based on their names.filter() picks cases based on their values.summarise() reduces multiple values down to a single summary.arrange() changes the ordering of the rows.group_by select and apply the functions above to specific valuemutate and transmutemutate() adds new variables and preserves existing onestransmute() adds new variables and drops existing ones.Old way:
Tidy way:
]
select and renameselect() keeps only the variables you mentionrename() keeps all variables.: to include ranges of variables- to exclude themstarts_with(): Starts with a prefix.ends_with(): Ends with a suffix.contains(): Contains a literal string.matches(): Matches a regular expression.num_range(): Matches a numerical range like x01, x02, x03.one_of(): Matches variable names in a character vector.everything(): Matches all variables.last_col(): Select last variable, possibly with an offset.filter==, >, >= etc&, |, !, xor()is.na()between(), near()summarise and group_bygroup_by() will result in one row in the output for each group.summarize with z worksmean(), median(), sd(), IQR(), mad(), min(), max(), quantile() ,first(), last(), nth(), n(), n_distinct(), any(), all()ungroup() to clear.arrangeNext session:
tidyrtidyrpurrr